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Abstract 

Genomic evolution can be highly heterogeneous. Here, we introduce a new framework to simulate genome-wide se- 
quence evolution under a variety of substitution models that may change along the genome and the phytogeny, following 
complex multispecies coalescent histories that can include recombination, demographics, longitudinal sampling, popu- 
lation subdivision/species history, and migration. A key aspect of our simulation strategy is that the heterogeneity of the 
whole evolutionary process can be parameterized according to statistical prior distributions specified by the user. We 
used this framework to carry out a study of the impact of variable codon frequencies across genomic regions on the 
estimation of the genome-wide nonsynonymous/synonymous ratio. We found that both variable codon frequencies 
across genes and rate variation among sites and regions can lead to severe underestimation of the global dN/dS 
values. The program SGWE — Simulation of Genome-Wide Evolution — is freely available from http://code.google.com/ 
p/sgwe-project/, including extensive documentation and detailed examples. 
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Computer simulations are important for different purposes 
in molecular evolution. For example, they can be used for 
hypothesis testing, to evaluate and validate analytical meth- 
ods, or to estimate evolutionary parameters (see Arenas 2012; 
Hoban et al. 2012). To our knowledge, only a few simulators of 
genome-wide evolution have been developed so far (table 1). 
Tools like EvolSimulator (Beiko and Charlebois 2007) and ALF 
(Dalquen et al. 2012) are able to actually simulate genomic 
events like duplication or rearrangement, while others basi- 
cally simulate multiple genomic regions. Except ALF, current 
genome-wide simulators assume a constant substitution pro- 
cess across the entire genome, which might seem overly sim- 
plistic (Arbiza et al. 201 1). In the case of ALF however, the user 
has to specify by hand the substitution model for any pre- 
defined genomic region, which can be too tedious for more 
than a few genes. Furthermore, in all these simulators, specific 
parameter values need to be specified by the user, which is 
not always an easy task. 

Here, we present a simulation framework called SGWE 
(Simulation of Genome-Wide Evolution) that is able to 
simulate multigene data sets accounting for heterogeneous 
evolution across genomic regions. Importantly, this heteroge- 
neity is controlled by the user through the specification of 
prior statistical distributions from which specific parameter 
values are sampled for each genomic region and replicate. 
Furthermore, evolutionary histories can be specified by 
the user or simulated by the multispecies coalescent 
with recombination — including hotspots and coldspots — 
demographics, and migration, among other evolutionary 



scenarios. We used this simulation framework to study the 
impact of variable codon frequencies across regions on the 
estimation of the dN/dS ratio. 

New Approaches: SGWE 

SGWE simulates genome-wide sequence evolution through 
the specification of genome-wide parameters and prior 
distributions for local parameters governing the evolution 
of the different genomic regions. Supplementary table SI, 
Supplementary Material online, shows a list of the different 
evolutionary scenarios that can be implemented in SGWE. 
The simulation procedure consists of two steps. In the first 
step, the user can specify every aspect of the simulation 
through a user-friendly Graphical User Interface (GUI), with 
the possibility of loading up to ten prespecified scenarios. The 
GUI window includes a series of frames where the user can 
define target evolutionary scenarios and the underlining prior 
distributions for the different parameters. In the second step, 
SGWE simulates each genomic region according to the spe- 
cific genome-wide and local parameters sampled from the 
prior distributions. Each simulated replicate consists of a set of 
aligned genomes (fig. 1). Internally, each genomic region is 
simulated under the multispecies coalescent with recombi- 
nation, including intracodon and hotspot recombination 
(Wiuf and Posada 2003; Arenas and Posada 2007; Arenas 
and Posada 2010), demographic periods, exponential 
growth, and several migration models with constant or 
time-dependent migration rates (Wright 1931; Kimura 
and Weiss 1964; Hudson 1998), longitudinal sampling 
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Region 1 Region 2 



Region 3 



Region 4 



Replicate 1 
Replicate 2 
Replicate 3 



K80+I GY94xM8xHKY ECMSchn2005+F GTR+I+Gsites 



Region 5 Region 6 




MG94xSYM+l HKYxCAT 



Fig. 1. Depiction of three genome alignments simulated with SGWE Each genome alignment contains six regions, printed with white and gray 
background to describe noncoding and coding regions, respectively. "+ I" indicates proportion of invariable sites, and "+ Gsites" indicates heterogeneity 
across sites according to a gamma distribution. "ECMSchn2005" indicates the empirical codon model by Schneider et al. (2005). "+ F" indicates empirical 
frequencies (e.g., user-specified) are considered. "CAT" indicates that frequencies change across sites within a region. 



(i.e., noncontemporaneous sequences) (Drummond et al. 
2002), under multiple nucleotide, codon, and protein substi- 
tution/replacement models. SGWE implements nucleotide 
substitution models like GTR (Tavare 1986) plus invariable 
sites and gamma-distributed rate variation among sites 
(i.e., the GTR+ I + G model) (Yang 1994) and special cases 
of it. In addition, the user can select codon models like 
GY94 x M0-M10 (Yang et al. 2000; Anisimova et al. 2001) 
where dN/dS can vary across branches, MG94 (Muse and 
Gaut 1994), Halpern and Bruno (HB) (Halpern and Bruno 
1998; Holder et al. 2008), or different empirical codon 
models (Schneider et al. 2005; Kosiol et al. 2007). Finally, 
SGWE also implements 16 empirical matrices and the CAT 
model (Lartillot and Philippe 2004) for amino acid replace- 
ment with variable frequencies across sites. 

Although SCWE implements a large variety of evolutionary 
scenarios and substitution models, it does not directly imple- 
ment indel evolution. This is mainly based on the complexity 
of simulating the coalescent with recombination (Hudson 
and Kaplan 1988) with indels, because the former requires a 
fixed sequence length. However, in a way that is transparent 
to the user, SGWE is able to call INDELible (Fletcher and Yang 
2009), a simulation software that implements a wide set of 
models of indel evolution along a fixed phylogeny. 

The SCWE pipeline is written in Java, C, Perl, and R and is 
freely available from http://code.google.eom/p/sgwe-project/ 
(last accessed March 4, 2014). The downloadable package 
includes executable files, source code, documentation, and 
a variety of practical examples. Furthermore, SGWE's coales- 
cent simulator can be used on its own on the command line 
for single locus simulations. This simulator is written in C, can 
run in parallel, and is freely available from http://code.google. 
com/p/coalevol/ (last accessed March 4, 2014). 

Benchmarking 

The implementation of SGWE was validated using theoretical 
expectations and/or comparisons with other simulation/ 
analytical software. For example: 

• Different simulation outcomes, like the time to the most 
recent common ancestor (TMRCA) or the number of re- 
combination events, were in agreement with theoretical 
expectations and with those obtained under the same 
settings using ms (Hudson 2002). 



• Simulated genealogies under diverse evolutionary scenar- 
ios were accurately reconstructed using Phyml (Guindon 
and Gascuel 2003). 

• Generating nucleotide and amino acid substitution models 
were correctly identified using jModelTest (Posada 2008) 
and ProtTest (Abascal et al. 2005), respectively. 

• Simulated dN/dS values were accurately estimated with 
PAML, Hyphy (Kosakovsky Pond et al. 2005), and SNAP 
(Korber 2000). 

Further details are given in supplementary note S1, 
Supplementary Material online. 

An Example: Influence of Heterogeneous 
Codon Frequencies and Substitution Rates 
on dN/dS Estimates 

To illustrate a potential use of SGWE, we studied the influence 
of variable transition/transversion rates ratio (ti/tv) and vari- 
able codon frequencies on the estimation of dN/dS (e.g., 
Oleksyk et al. 2010; Kjeldsen et al. 2012; Smith et al. 2013). 
Using SWGE, we simulated genome alignments were dN/dS 
was kept constant across the different genomic regions, but 
ti/tv and the codon frequencies varied among them. Then, we 
estimated dN/dS assuming that all parameters were constant 
along the different genomic regions. 

In the absence of rate variation among sites or regions, 
when only the ti/tv (fig. 2) or the GTR matrices (fig. 3 and 
supplementary figs. S1 and S2, Supplementary Material 
online, upper plots) varied across regions, the dN/dS estimates 
were very accurate. On the contrary, when the codon fre- 
quencies varied across regions, the global dN/dS was consis- 
tently underestimated (figs. 2 and 3 and supplementary 
figs. S1 and S2, Supplementary Material online; white bars). 
For example for simulated values of 2.0, 1.0, and 0.5, the av- 
erage dN/dS estimates were 1.25 ±0.04, 0.71 ±0.02, and 
0.43 ± 0.01, respectively (fig. 2). However, if the average of 
the local dN/dS estimates for each region was considered as 
an estimate of global dN/dS, the bias was not observed (figs. 2 
and 3 and supplementary figs. SI and S2, Supplementary 
Material online, upper plots; gray bars). 

Introducing rate variation among sites and regions resulted 
in a very complex picture, where different combination of 
parameters resulted in underestimates or overestimates of 
the simulated dN/dS value (fig. 3 and supplementary figs. SI 
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Ftc. 2. Influence of variable codon frequencies and variable ti/tv across regions on the estimation of the genome-wide dN/dS when the true dN/dS value 
is 0.5, 1.0, and 2.0. The horizontal dashed black line indicates the simulated dN/dS value. White bars indicate the estimated dN/dS from the entire 
genome, while the gray bars display the averaged dN/dS across regions. Error bars indicate 95% confidence intervals. 



and S2, Supplementary Material online). When codon fre- 
quencies were constant across regions, the M0 model resulted 
in accurate dN/dS estimates, but estimates under models M8 
and especially M5 were biased upward. When codon frequen- 
cies varied across regions, the dN/dS estimates were biased 
downward. In general, these biases were more pronounced 
when different regions had distinct levels of among-site 
rate variation (fig. 3 and supplementary figs. SI and S2, 
Supplementary Material online, lower plots). 

Discussion 

Sequence evolution across different genomic regions can be 
highly heterogeneous (e.g., Gibbs et al. 2007; Arbiza et al. 
2011). Simulation and empirical studies tend to ignore this 
heterogeneity and assume that multigene data sets evolve 
under one or very few substitution models. SCW£ imple- 
ments a simulation framework to simulate genome-wide se- 
quence evolution that accounts for evolutionary 
heterogeneity in time and (sequence) space, better reflecting 
the evolutionary process shaping real data. A key aspect of 
SGWE is that the heterogeneity of the whole evolutionary 



process can be parameterized according to statistical prior 
distributions specified by the user, allowing much needed 
flexibility. We believe that SCWE is complementary to other 
comprehensive tools like ALF, which implement a range of 
genomic events not included in SCWE but which cannot 
handle easily variation across regions and does not currently 
simulate population-genetic events such as recombination or 
lineage sorting within species trees. 

At this point, SGWE's coalescent simulator and INDELible 
cannot run at the same time in a given simulation experi- 
ment, so recombination simulations cannot be run with 
indels, for example. Which one to choose depends on the 
particular biological scenario that the user wants to imple- 
ment. In general, coalescent simulations should be more 
useful in intraspecific scenarios or in interspecific situations 
with incomplete lineage sorting, and phylogenetic simulations 
with INDELible should be more appropriate for interspecific 
evolution with no phylogenomic incongruence (i.e., where 
gene trees across the genome are equal). A detailed list of 
the capabilities implemented in SCWE is shown in the sup- 
plementary table S1, Supplementary Material online. 
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Fig. 3. Influence of variable codon frequencies, variable transition rates, and gamma-distributed rate variation among sites and across regions on the 
estimation of the genome-wide dN/dS when the true dN/dS value is 1.0. The horizontal dashed black line indicates the true, simulated value. White bars 
indicate the estimated dN/dS from the entire genome, while the gray bars display the averaged dN/dS across regions. Error bars indicate 95% confidence 
intervals. 



As an example of the use of SCWE, we studied the impact 
of variable codon frequencies and among-site rate variation 
across genomic regions on the estimation of dN/dS. It is well 
known that protein-coding sequences usually show variable 
frequencies across protein regions as a consequence of 
protein folding solvent accessibility, and protein function 
(e.g., Goldman et al. 1998; Lio and Goldman 1999; Liberies 
et al. 2012; Arenas et al. 2013). While different models 
of sequence evolution exist capable of accounting for this 
heterogeneity (Bruno 1996; Halpern and Bruno 1998; Pagel 
and Meade 2004; Holder et al. 2008), these are seldom used in 
real data — at least at the DNA and codon level — probably 
because they are computationally very intensive. Our 



simulation experiments with SCWE show that, in general, 
variable codon frequencies can result in the underestimation 
of genome-wide dN/dS values, while rate variation among 
sites and regions seem to have the opposite effect. On the 
other hand, the average of the regional estimates seems to be 
a good approximation of the genome-wide dN/dS value. The 
fact that model misspecification cause error in dN/dS estima- 
tion is hardly surprising, but our simulations confirm this 
expectation and more importantly quantify the bias. 
Indeed, most studies do not rely on a single, genome-wide 
dN/dS estimate, but they might still try to obtain single esti- 
mates from single genomic fragments that in fact could in- 
clude distinct substitution models (i.e., fragments that 
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encompass multiple genomic regions) and therefore be the 
subject of similar biases. 

Here, we are under a model underfitting scenario, where 
the model assumed for parameter estimation is always sim- 
pler than the true model used to simulate the data. It is 
known that this circumstance usually leads to parameter un- 
derestimation, for example of the branch lengths or of the 
ti/tv ratio (Tamura 1992; Lemmon and Moriarty 2004). In our 
simulations, the reasons why some model violations can 
induce underestimation of the global dN/dS and others over- 
estimation are not straightforward. Moreover, different mis- 
specifications of the assumed model operate here in opposite 
directions. Ignoring variable codon frequencies seems to push 
the dN/dS estimates downward. In particular, we could see 
that this was due to the simultaneous underestimation of dN 
and the overestimation of dS. Also, increasing the number of 
variable regions accentuated this bias (data not shown). 
Codon frequency biases have been shown before to induce 
underestimation of dN/dS for some ML methods (Yang and 
Nielsen 2000). Accordingly, variation in GC content along a 
sequence seems to reduce the number of true positives of the 
branch-site test (Gharib and Robinson-Rechavi 2013). 

On the other hand, ignoring rate variation among sites, 
especially when this change among regions, biased the dN/dS 
estimates upward under the M5 and M8 models, but not 
under the M0 model. The M5 model assumes a gamma dis- 
tribution for dN/dS variation among sites, while M8 adds to 
M5 a proportion of sites with dN/dS > 1. In the simulations, 
dN/dS was always constant across regions, but the bias ap- 
peared when the substitution rate changed within regions, 
and specially when it did it in different way in different regions 
(i.e., according to different gamma distributions). The exact 
reasons for this are not straightforward, although it is known 
that the M5 and M8 models can be less conservative than the 
M0 model (Yang et al. 2000; Metzger and Thomas 2010). 

Apart from simulation studies like the one implemented 
here, SCWE could also be used to benchmark species tree 
estimation, to understand the interactions between different 
evolutionary forces at the genome-wide level or to estimate 
evolutionary parameters and perform model choice using 
approximate Bayesian computation (Beaumont 2010; Lopes 
et al. 2014). 

Material and Methods 

Simulation of Variable Codon Frequencies, ti/tv, and 
Substitution Rates across Genomic Regions 
Gene genealogies for each genomic region were simulated 
under the coalescent assuming a constant effective popula- 
tion size of 1,000 and a sample size of 15 individuals. Each 
individual genome was composed of 15 genomic regions or 
genes, with 150 codons each. Genomic sequences were 
evolved over these genealogies assuming a GY94 x M0 
codon model under three genome-wide dN/dS values: 0.5, 
1, and 2. Transition/transversion (ti/tv) ratios were either 
fixed to 0.5 or varied across regions according to a Uniform 
distribution truncated between 0.5 and 15. Substitution rates 
(A-C, A-G, A-T, C-G, C-T, G-T) varied across regions according 



to a Dirichlet distribution D(6, 16,2,8,20,4) that was then scaled 
with the last rate. Scenarios with rate variation across sites 
were simulated according to a gamma distribution ( + G) 
with shape 0.7. Scenarios where this gamma distribution 
varied across regions drew the different gamma shapes 
from to an exponential distribution with mean 2.0 and trun- 
cated between 0.5 and 5.0. Such parameter values are typical 
of RNA virus like HIV-1 (Carvajal-Rodriguez et al. 2006). 
Codon frequencies were specified according to the nucleotide 
frequencies at each codon position. The latter were either 
constant (0.25 for each codon position) or varied across re- 
gions according to a Dirichlet distribution (D(1, 1,1,1) for each 
codon position). For each scenario, we simulated a total of 
100 genome alignments. 

Estimation of dN/dS 

Genome-wide dN/dS values were estimated using the codeml 
program from PAML (Yang 2007) under the GY94 x M0 
(constant dN/dS), GY94 x M5 (dN/dS follows a Gamma 
distribution), and GY94 x M8 (two categories, Beta 
distribution + dN/dS > 1) codon models (Yang et al. 2000). 
We choose PAML because it is a well-known, commonly used, 
comprehensive, and validated software to estimate dN/dS. 
The assumed codon frequencies were calculated as a func- 
tion of the empirical nucleotide frequencies at each codon 
position. As a sanity check, similar dN/dS estimates 
under GY94 x M0 codon model were obtained when 
we used Hyphy instead of PAML (supplementary fig. S4, 
Supplementary Material online). 

Supplementary Material 

Supplementary figures SI -S3 and tables S1-S6 are available at 
Molecular Biology and Evolution online (http://www.mbe. 
oxfordjournals.org/). 
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